Capstone project

Datascience Lifecycle mainly involves:

  1. <a href=#data_collection>Data collection</a>
  2. <a href=#data_processing>Data Processing</a>
  3. <a href=#data_viz>Exploratory Data Analysis</a>
  4. <a href=#hypothesis_ML>Hypothesis testing and Machine Learning</a>
  5. <a href=#data_insights>Insight and Policy decision</a>

I will use data from 45 a kaggle walmart store sales forecasting competition to show the lifecycle of a data science project.

The ability to accurately predict weekly sales for any retail store is crucial in planning for inventory and maximizing efficiency to improve customer experience and profits.

Below are the details instructions for the competition: You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.

In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data

In [173]:
#import some of the libraries needed for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bokeh.charts import Line, Scatter, TimeSeries, show, output_file, vplot
from bokeh.io import output_notebook
from bokeh.charts import Bar, output_file, show
from bokeh.sampledata.autompg import autompg as df
output_notebook()
Loading BokehJS ...

A. Data Collection

Features file description This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

  1. Store - the store number
  2. Date - the week
  3. Temperature - average temperature in the region
  4. Fuel_Price - cost of fuel in the region
  5. MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
  6. CPI - the consumer price index
  7. Unemployment - the unemployment rate
  8. IsHoliday - whether the week is a special holiday week
In [174]:
features = pd.read_csv("features.csv")
features.head()
Out[174]:
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
0 1 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False
1 1 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True
2 1 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False
3 1 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False
4 1 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False

Stores.csv file: description This file contains anonymized information about the 45 stores, indicating the type and size of store.

In [175]:
stores = pd.read_csv("stores.csv")
stores.head()
Out[175]:
Store Type Size
0 1 A 151315
1 2 A 202307
2 3 B 37392
3 4 A 205863
4 5 B 34875

Train.csv file description: This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:

  1. Store - the store number
  2. Dept - the department number
  3. Date - the week
  4. Weekly_Sales - sales for the given department in the given store
  5. IsHoliday - whether the week is a special holiday week
In [176]:
train = pd.read_csv("train.csv")
train.head()
Out[176]:
Store Dept Date Weekly_Sales IsHoliday
0 1 1 2010-02-05 24924.50 False
1 1 1 2010-02-12 46039.49 True
2 1 1 2010-02-19 41595.55 False
3 1 1 2010-02-26 19403.54 False
4 1 1 2010-03-05 21827.90 False

Test.csv file description This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file

In [177]:
test = pd.read_csv("test.csv")
test.head()
Out[177]:
Store Dept Date IsHoliday
0 1 1 2012-11-02 False
1 1 1 2012-11-09 False
2 1 1 2012-11-16 False
3 1 1 2012-11-23 True
4 1 1 2012-11-30 False

2. Data Processing

Split the dates of all tables into year, month and day.

In [178]:
# Functions splits a 
#parameters: dates -  pandas column dataframe containing dates in the format year-month-date
# returns: a tuple of arrays containing year, month and day in the format (year, month, date)
def split_date(dates):
    year = []
    month = []
    day = []
    new_date = []
    for row in dates:
        new_date = row.split("-")
        year.append(int(new_date[0]))
        month.append(int(new_date[1]))
        day.append(int(new_date[2]))
        
    return (year, month, day)
        
        
    
In [179]:
#Split the dates for the test table into year, month and Day
year, month, day = split_date(test["Date"])
test["Year"] = year
test["Month"] = month
test["Day"] = day
test.head()
Out[179]:
Store Dept Date IsHoliday Year Month Day
0 1 1 2012-11-02 False 2012 11 2
1 1 1 2012-11-09 False 2012 11 9
2 1 1 2012-11-16 False 2012 11 16
3 1 1 2012-11-23 True 2012 11 23
4 1 1 2012-11-30 False 2012 11 30
In [180]:
#Split the dates for the train table into year, month and Day
year, month, day = split_date(train["Date"])
train["Year"] = year
train["Month"] = month
train["Day"] = day
train.head()
Out[180]:
Store Dept Date Weekly_Sales IsHoliday Year Month Day
0 1 1 2010-02-05 24924.50 False 2010 2 5
1 1 1 2010-02-12 46039.49 True 2010 2 12
2 1 1 2010-02-19 41595.55 False 2010 2 19
3 1 1 2010-02-26 19403.54 False 2010 2 26
4 1 1 2010-03-05 21827.90 False 2010 3 5
In [181]:
year, month, day = split_date(features["Date"])
features["Year"] = year
features["Month"] = month
features["Day"] = day
features.head()
Out[181]:
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday Year Month Day
0 1 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False 2010 2 5
1 1 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True 2010 2 12
2 1 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False 2010 2 19
3 1 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False 2010 2 26
4 1 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False 2010 3 5

We would also need to determine if any relationship exists between our sales data and other features which are in other tables. For easier analysis between sales in training data and other features I will left join train table with stores table and features table to add more columns to the training data.

left join train(left) table with features(right) table

In [182]:
train_data = train.merge(features, on =['Store', 'IsHoliday', 'Year', 'Month', 'Day'], how = 'left')
train_data    
Out[182]:
Store Dept Date_x Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment
0 1 1 2010-02-05 24924.50 False 2010 2 5 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106
1 1 1 2010-02-12 46039.49 True 2010 2 12 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106
2 1 1 2010-02-19 41595.55 False 2010 2 19 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106
3 1 1 2010-02-26 19403.54 False 2010 2 26 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106
4 1 1 2010-03-05 21827.90 False 2010 3 5 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106
5 1 1 2010-03-12 21043.39 False 2010 3 12 2010-03-12 57.79 2.667 NaN NaN NaN NaN NaN 211.380643 8.106
6 1 1 2010-03-19 22136.64 False 2010 3 19 2010-03-19 54.58 2.720 NaN NaN NaN NaN NaN 211.215635 8.106
7 1 1 2010-03-26 26229.21 False 2010 3 26 2010-03-26 51.45 2.732 NaN NaN NaN NaN NaN 211.018042 8.106
8 1 1 2010-04-02 57258.43 False 2010 4 2 2010-04-02 62.27 2.719 NaN NaN NaN NaN NaN 210.820450 7.808
9 1 1 2010-04-09 42960.91 False 2010 4 9 2010-04-09 65.86 2.770 NaN NaN NaN NaN NaN 210.622857 7.808
10 1 1 2010-04-16 17596.96 False 2010 4 16 2010-04-16 66.32 2.808 NaN NaN NaN NaN NaN 210.488700 7.808
11 1 1 2010-04-23 16145.35 False 2010 4 23 2010-04-23 64.84 2.795 NaN NaN NaN NaN NaN 210.439123 7.808
12 1 1 2010-04-30 16555.11 False 2010 4 30 2010-04-30 67.41 2.780 NaN NaN NaN NaN NaN 210.389546 7.808
13 1 1 2010-05-07 17413.94 False 2010 5 7 2010-05-07 72.55 2.835 NaN NaN NaN NaN NaN 210.339968 7.808
14 1 1 2010-05-14 18926.74 False 2010 5 14 2010-05-14 74.78 2.854 NaN NaN NaN NaN NaN 210.337426 7.808
15 1 1 2010-05-21 14773.04 False 2010 5 21 2010-05-21 76.44 2.826 NaN NaN NaN NaN NaN 210.617093 7.808
16 1 1 2010-05-28 15580.43 False 2010 5 28 2010-05-28 80.44 2.759 NaN NaN NaN NaN NaN 210.896761 7.808
17 1 1 2010-06-04 17558.09 False 2010 6 4 2010-06-04 80.69 2.705 NaN NaN NaN NaN NaN 211.176428 7.808
18 1 1 2010-06-11 16637.62 False 2010 6 11 2010-06-11 80.43 2.668 NaN NaN NaN NaN NaN 211.456095 7.808
19 1 1 2010-06-18 16216.27 False 2010 6 18 2010-06-18 84.11 2.637 NaN NaN NaN NaN NaN 211.453772 7.808
20 1 1 2010-06-25 16328.72 False 2010 6 25 2010-06-25 84.34 2.653 NaN NaN NaN NaN NaN 211.338653 7.808
21 1 1 2010-07-02 16333.14 False 2010 7 2 2010-07-02 80.91 2.669 NaN NaN NaN NaN NaN 211.223533 7.787
22 1 1 2010-07-09 17688.76 False 2010 7 9 2010-07-09 80.48 2.642 NaN NaN NaN NaN NaN 211.108414 7.787
23 1 1 2010-07-16 17150.84 False 2010 7 16 2010-07-16 83.15 2.623 NaN NaN NaN NaN NaN 211.100385 7.787
24 1 1 2010-07-23 15360.45 False 2010 7 23 2010-07-23 83.36 2.608 NaN NaN NaN NaN NaN 211.235144 7.787
25 1 1 2010-07-30 15381.82 False 2010 7 30 2010-07-30 81.84 2.640 NaN NaN NaN NaN NaN 211.369903 7.787
26 1 1 2010-08-06 17508.41 False 2010 8 6 2010-08-06 87.16 2.627 NaN NaN NaN NaN NaN 211.504662 7.787
27 1 1 2010-08-13 15536.40 False 2010 8 13 2010-08-13 87.00 2.692 NaN NaN NaN NaN NaN 211.639421 7.787
28 1 1 2010-08-20 15740.13 False 2010 8 20 2010-08-20 86.65 2.664 NaN NaN NaN NaN NaN 211.603363 7.787
29 1 1 2010-08-27 15793.87 False 2010 8 27 2010-08-27 85.22 2.619 NaN NaN NaN NaN NaN 211.567306 7.787
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
421540 45 98 2012-04-06 778.70 False 2012 4 6 2012-04-06 49.73 3.996 13450.45 NaN 26.59 3363.54 3962.01 190.685171 8.567
421541 45 98 2012-04-13 559.14 False 2012 4 13 2012-04-13 51.83 4.044 4736.94 6047.12 16.68 1355.11 2246.12 190.759596 8.567
421542 45 98 2012-04-20 605.80 False 2012 4 20 2012-04-20 63.13 4.027 9210.90 2667.05 6.25 970.33 2568.22 190.813801 8.567
421543 45 98 2012-04-27 619.41 False 2012 4 27 2012-04-27 53.20 4.004 2872.86 0.03 37.12 74.97 2495.29 190.868006 8.567
421544 45 98 2012-05-04 694.25 False 2012 5 4 2012-05-04 55.21 3.951 11984.62 NaN 47.52 6150.63 1775.54 190.922212 8.567
421545 45 98 2012-05-11 893.60 False 2012 5 11 2012-05-11 61.24 3.889 12611.18 NaN 21.36 1667.39 2313.12 190.976417 8.567
421546 45 98 2012-05-18 745.44 False 2012 5 18 2012-05-18 66.30 3.848 6813.74 NaN 13.86 936.49 2941.55 190.996448 8.567
421547 45 98 2012-05-25 795.94 False 2012 5 25 2012-05-25 67.21 3.798 5370.39 NaN 361.22 1287.62 2461.81 191.002810 8.567
421548 45 98 2012-06-01 874.64 False 2012 6 1 2012-06-01 74.48 3.742 10643.62 48.20 201.56 2599.35 1856.97 191.009171 8.567
421549 45 98 2012-06-08 713.50 False 2012 6 8 2012-06-08 64.30 3.689 8429.61 148.60 76.15 1465.54 7180.97 191.015533 8.567
421550 45 98 2012-06-15 856.35 False 2012 6 15 2012-06-15 71.93 3.620 8148.19 252.70 90.05 1226.11 2942.39 191.029973 8.567
421551 45 98 2012-06-22 622.62 False 2012 6 22 2012-06-22 74.22 3.564 5565.31 222.50 1.16 2149.76 7474.76 191.064610 8.567
421552 45 98 2012-06-29 690.52 False 2012 6 29 2012-06-29 75.22 3.506 3291.36 425.60 NaN 314.88 2255.34 191.099246 8.567
421553 45 98 2012-07-06 659.65 False 2012 7 6 2012-07-06 82.99 3.475 5878.09 169.00 64.36 2996.47 5477.04 191.133883 8.684
421554 45 98 2012-07-13 695.21 False 2012 7 13 2012-07-13 79.97 3.523 2978.08 118.90 22.07 1899.54 5181.94 191.168519 8.684
421555 45 98 2012-07-20 845.30 False 2012 7 20 2012-07-20 78.89 3.567 6682.75 24.24 12.83 2249.85 2288.30 191.167043 8.684
421556 45 98 2012-07-27 657.63 False 2012 7 27 2012-07-27 77.20 3.647 5753.81 167.95 1.23 9181.48 3156.06 191.165566 8.684
421557 45 98 2012-08-03 516.46 False 2012 8 3 2012-08-03 76.58 3.654 24853.05 39.56 17.96 11142.69 2768.32 191.164090 8.684
421558 45 98 2012-08-10 727.49 False 2012 8 10 2012-08-10 78.65 3.722 17868.84 50.60 57.66 2593.93 1890.59 191.162613 8.684
421559 45 98 2012-08-17 500.16 False 2012 8 17 2012-08-17 75.71 3.807 3657.79 6.00 0.30 1630.50 3794.22 191.228492 8.684
421560 45 98 2012-08-24 415.40 False 2012 8 24 2012-08-24 72.62 3.834 7936.20 58.38 22.00 5518.07 2291.97 191.344887 8.684
421561 45 98 2012-08-31 346.04 False 2012 8 31 2012-08-31 75.09 3.867 23641.30 6.00 92.93 6988.31 3992.13 191.461281 8.684
421562 45 98 2012-09-07 352.44 True 2012 9 7 2012-09-07 75.70 3.911 11024.45 12.80 52.63 1854.77 2055.70 191.577676 8.684
421563 45 98 2012-09-14 605.96 False 2012 9 14 2012-09-14 67.87 3.948 11407.95 NaN 4.30 3421.72 5268.92 191.699850 8.684
421564 45 98 2012-09-21 467.30 False 2012 9 21 2012-09-21 65.32 4.038 8452.20 92.28 63.24 2376.38 8670.40 191.856704 8.684
421565 45 98 2012-09-28 508.37 False 2012 9 28 2012-09-28 64.88 3.997 4556.61 20.64 1.50 1601.01 3288.25 192.013558 8.684
421566 45 98 2012-10-05 628.10 False 2012 10 5 2012-10-05 64.89 3.985 5046.74 NaN 18.82 2253.43 2340.01 192.170412 8.667
421567 45 98 2012-10-12 1061.02 False 2012 10 12 2012-10-12 54.47 4.000 1956.28 NaN 7.89 599.32 3990.54 192.327265 8.667
421568 45 98 2012-10-19 760.01 False 2012 10 19 2012-10-19 56.47 3.969 2004.02 NaN 3.18 437.73 1537.49 192.330854 8.667
421569 45 98 2012-10-26 1076.80 False 2012 10 26 2012-10-26 58.85 3.882 4018.91 58.08 100.00 211.94 858.33 192.308899 8.667

421570 rows × 18 columns

Delete unnnecessary columns such as Date_x and Date_y

In [183]:
del train_data['Date_x']
train_data.columns
Out[183]:
Index(['Store', 'Dept', 'Weekly_Sales', 'IsHoliday', 'Year', 'Month', 'Day',
       'Date_y', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment'],
      dtype='object')

Left join train_data(left) table with stores(right) table This will help us get additional columns: store type and size which help us have more features to analyze our sales

In [184]:
train_data = train_data.merge(stores, on = 'Store', how = 'left')
train_data.head()
Out[184]:
Store Dept Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size
0 1 1 24924.50 False 2010 2 5 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 1 46039.49 True 2010 2 12 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 A 151315
2 1 1 41595.55 False 2010 2 19 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 A 151315
3 1 1 19403.54 False 2010 2 26 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 A 151315
4 1 1 21827.90 False 2010 3 5 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 A 151315

Change the column data types to types that allow for easier analysis and manipulation of data

  1. train_data['Date_y'] = pd.to_datetime(train_data['Date_y'], yearfirst = True)
  2. Change the following columns from string to numeric data type: Store, Dept, Fuel_Price, Weekly sales, Markdown1 to Markdown5, CPI, Unemployment, Size
  3. Change the IsHoliday column from boolean to binary values 1 representing True and 0 represents false
  4. Change the Store type column from letters to numbers that correspond to specific letters
In [185]:
#1. type changed from string to date_time type
train_data['Date_y'] = pd.to_datetime(train_data['Date_y'], yearfirst = True)

#2. type changed from string to numeric
train_data[['Store','Dept','Weekly_Sales', 'Temperature']] = train_data[['Store','Dept','Weekly_Sales', 'Temperature']].apply(pd.to_numeric)
train_data[['Fuel_Price','CPI','Unemployment', 'Size']] = train_data[['Fuel_Price','CPI','Unemployment', 'Size']].apply(pd.to_numeric)
train_data[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']] = train_data[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].apply(pd.to_numeric)

#3. Change IsHoliday from boolean to binary
holiday = []
types = []
for index, row in train_data.iterrows():
    if (row['IsHoliday'] == True):
        holiday.append(1)
    else:
        holiday.append(0)

        
#4. Change Type column to numeric values            
     
    if (row['Type'] == 'A'):
        types.append(1)
    elif (row['Type'] == 'B'):
        types.append(2)
    elif (row['Type'] == 'C'):
        types.append(3)
    else:
        types.append(4)

        
train_data['Holiday'] = holiday  
train_data['Type_n'] = types
    
train_data.head()
Out[185]:
Store Dept Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price ... MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size Holiday Type_n
0 1 1 24924.50 False 2010 2 5 2010-02-05 42.31 2.572 ... NaN NaN NaN NaN 211.096358 8.106 A 151315 0 1
1 1 1 46039.49 True 2010 2 12 2010-02-12 38.51 2.548 ... NaN NaN NaN NaN 211.242170 8.106 A 151315 1 1
2 1 1 41595.55 False 2010 2 19 2010-02-19 39.93 2.514 ... NaN NaN NaN NaN 211.289143 8.106 A 151315 0 1
3 1 1 19403.54 False 2010 2 26 2010-02-26 46.63 2.561 ... NaN NaN NaN NaN 211.319643 8.106 A 151315 0 1
4 1 1 21827.90 False 2010 3 5 2010-03-05 46.50 2.625 ... NaN NaN NaN NaN 211.350143 8.106 A 151315 0 1

5 rows × 21 columns

3. Exploratory Data Analysis and Visualization

We are going to reduce the size of the dataframe to a sample of 10,000 rows for easier visualization and manipulation because the current size of the dataframe requires huge computing power

In [187]:
#save the current dataframe to another variable
train_df = train_data.copy(deep = True)
train_df = train_df.sample(n = 30000)
train_df.head()
Out[187]:
Store Dept Weekly_Sales IsHoliday Year Month Day Date_y Temperature Fuel_Price ... MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size Holiday Type_n
383381 41 19 458.53 False 2011 7 8 2011-07-08 68.49 3.540 ... NaN NaN NaN NaN 193.187845 6.901 A 196321 0 1
218785 23 14 29852.89 False 2012 2 24 2012-02-24 31.12 3.751 ... 10478.33 NaN 7541.62 7320.47 137.341103 4.261 B 114533 0 2
255739 26 92 76982.78 False 2010 7 2 2010-07-02 58.90 2.815 ... NaN NaN NaN NaN 132.521867 8.445 A 152513 0 1
203898 21 55 11085.52 False 2011 2 18 2011-02-18 59.87 3.045 ... NaN NaN NaN NaN 212.903312 8.028 B 140167 0 2
106617 11 90 44244.03 False 2011 1 28 2011-01-28 51.04 3.010 ... NaN NaN NaN NaN 215.506582 7.551 A 207499 0 1

5 rows × 21 columns

In [79]:
t = TimeSeries(train_df,
    x='Date_y', y=['Weekly_Sales'],
    title="Timeseries", ylabel='Weekly Sales', legend=True)

output_file("weeklysales.html")
show(t)
INFO:bokeh.core.state:Session output file 'weeklysales.html' already exists, will be overwritten.

The graph above clearly shows a spike in sales during specific times within the year

Draw a line chart to get the Average monthly sales for walmart stores

In [80]:
#create dataframe to hold the mean monthly sales
monthly_sales = train_df.groupby(['Month'])['Weekly_Sales'].mean()
monthly_sales = monthly_sales.to_frame()
monthly_sales
Out[80]:
Weekly_Sales
Month
1 14017.301816
2 16018.929659
3 15268.072106
4 15381.758817
5 15428.964589
6 16596.437423
7 16235.490663
8 15739.737647
9 14969.090956
10 15511.682713
11 17603.914825
12 19527.915000
In [81]:
monthly_sales_bar = Bar(monthly_sales, values='Weekly_Sales', title="Mean Monthly Sales")
output_file("Mean_monthly_sales.html")
show(monthly_sales_bar)
INFO:bokeh.core.state:Session output file 'Mean_monthly_sales.html' already exists, will be overwritten.

From the data, december has more weekly sales than other months

Monthly sales by type of store

In [84]:
monthly_sales_bytype = train_df.groupby(['Month','Type'])['Weekly_Sales'].mean()
monthly_sales_bytype = monthly_sales_bytype.to_frame() 
monthly_sales_bytype.reset_index(inplace=True)
monthly_sales_bytype.head()
Out[84]:
Month Type Weekly_Sales
0 1 A 17777.018195
1 1 B 10224.881230
2 1 C 9292.402321
3 2 A 20276.177580
4 2 B 12012.089956
5 2 C 8552.967925
6 3 A 19510.110634
7 3 B 11713.476235
8 3 C 7976.744924
9 4 A 19746.102259
10 4 B 11215.457309
11 4 C 10106.092667
12 5 A 19011.113808
13 5 B 12314.155379
14 5 C 8551.056748
15 6 A 20726.445759
16 6 B 13080.321787
17 6 C 9050.863011
18 7 A 21163.845186
19 7 B 11555.882283
20 7 C 10040.736786
21 8 A 19952.564372
22 8 B 12066.830094
23 8 C 8466.789358
24 9 A 18281.542998
25 9 B 11263.958343
26 9 C 10959.555367
27 10 A 19600.998379
28 10 B 12055.412274
29 10 C 9418.525828
30 11 A 20980.832792
31 11 B 15203.998102
32 11 C 8927.127771
33 12 A 23827.419050
34 12 B 16539.451523
35 12 C 9640.071053
In [85]:
monthly_sales_bytype_bar = Bar(monthly_sales_bytype, label = 'Month', values='Weekly_Sales', group = 'Type',  title="Mean Monthly Sales By Type of Store")
output_file("Mean_monthly_sales_bytype.html")
show(monthly_sales_bytype_bar)
INFO:bokeh.core.state:Session output file 'Mean_monthly_sales_bytype.html' already exists, will be overwritten.

from the bar chart above it is clear that stores of type A have highest average weekly sales followed by stores of type B then type C

In [90]:
p = Scatter(train_df, x='Temperature', y='Weekly_Sales',  title="Temperature vs Weekly Sales",
            xlabel="Temperature in Degrees Farenheit", ylabel="Weekly Sales")

output_file("temperature_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'temperature_sales_scatter.html' already exists, will be overwritten.

From the table above there is no relationship between temperature and weekly sales

Is there a relationship between walmart store size and weekly sales?

In [97]:
store_size = train_df.groupby(['Size'])["Weekly_Sales"].mean()
store_size = store_size.to_frame()
store_size.reset_index(inplace=True)
store_size.head()
Out[97]:
Size Weekly_Sales
0 34875 5059.297316
1 37392 5952.538877
2 39690 7811.634458
3 39910 7904.774241
4 41062 14137.648702
In [102]:
p = Scatter(store_size, x='Size', y='Weekly_Sales',  title="Store size vs Weekly Sales",
            xlabel="Store size in sqft", ylabel="Weekly Sales")

output_file("store_size_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'store_size_sales_scatter.html' already exists, will be overwritten.

From the scatter plot there seems to be a linear relationship between store size and the average amount of weekly sales

Fuel_Price, CPI, Unemployment

Is there a relationship between fuel price and Weekly sales?

In [101]:
p = Scatter(train_df, x='Fuel_Price', y='Weekly_Sales',  title="Fuel Price vs Weekly Sales",
            xlabel="Fuel Price in dollars/gallon", ylabel="Weekly Sales")

output_file("fuel_price_sales_scatter.html")

show(p)
INFO:bokeh.core.state:Session output file 'fuel_price_sales_scatter.html' already exists, will be overwritten.

Conclusion: From the plot above there is no relationship between fuel price and weekly sales

Is there a relationship between Consumer Price Index(CPI) and Weekly sales?

In [103]:
p = Scatter(train_df, x='CPI', y='Weekly_Sales',  title="Consumer Price Index vs Weekly Sales",
            xlabel="Consumer Price Index", ylabel="Weekly Sales")

output_file("CPI_sales_scatter.html")

show(p)

From the scatter plot above there is no relationship between consumer price Index and weekly sales

Is there a relationship between Unemployment and Weekly sales?

In [104]:
p = Scatter(train_df, x='Unemployment', y='Weekly_Sales',  title="Unemployment vs Weekly Sales",
            xlabel="Unemployment", ylabel="Weekly Sales")

output_file("unemployment_sales_scatter.html")

show(p)

From the Scatter plot above there exists no relationship between unemployment and weekly sales

Is there a relationship between holidays and weekly sales

In [113]:
holiday_df = train_df.groupby(['Holiday'])['Weekly_Sales'].mean()
holiday_df = holiday_df.to_frame()
holiday_df.reset_index(inplace=True)
holiday_df.head()
Out[113]:
Holiday Weekly_Sales
0 0 15928.847907
1 1 16482.231966
In [114]:
holiday_df_bar = Bar(holiday_df, values='Weekly_Sales',  title="Mean Holiday Sales")
output_file("Mean_holiday_sales.html")
show(holiday_df_bar)
INFO:bokeh.core.state:Session output file 'Mean_holiday_sales.html' already exists, will be overwritten.

Conclusion: From the bar graph above holidays experience more sales than non-holidays

Is there a relationship between Store department and weekly sales?

In [119]:
dept_df = train_df.groupby(['Dept'])['Weekly_Sales'].mean()
dept_df = dept_df.to_frame()
dept_df.reset_index(inplace=True)
dept_df.head()
Out[119]:
Dept Weekly_Sales
0 1 20089.599784
1 2 42736.740112
2 3 12796.384205
3 4 26186.560693
4 5 20877.189643
In [120]:
dept_df_bar = Bar(dept_df, values='Weekly_Sales',  title="Mean Holiday Sales")
output_file("Mean_department_sales.html")
show(dept_df_bar)
INFO:bokeh.core.state:Session output file 'Mean_department_sales.html' already exists, will be overwritten.

From the Bar chart above there seems to be a strong relationship between Store department and weekly sales

Summary of Exploratory Data Analysis

Store size, Holiday, Store type, Store department month of the year have a strong effect on weekly Sales while other factors such as Temperature, Consumer Price Index (CPI), Fuel price and Unemployment have little to no impact on weekly sales

4. Analysis, Hypothesis testing and Machine Learning

In [188]:
#import libraries that will be used for machine learning
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor

From the exploratory data analysis Store size, Holiday, Store type, Store department and month of the year have an impact on a store's weekly sales. Therefore we will only consider these factors in determining weekly sales. I will use a KNN regression classifier because to predict future weekly sales because since weekly sales are influenced by several factors such as Store size, Holiday and Store department, I will first find the nearest neighbors to the values I am trying to predict by classifying values based on Holiday, Store type, Store department, month and year. I will then apply linear regression on the nearest neighbors to predict the values of the score.

Below is a simple example of a KNN regressor predictor

In [124]:
#sample input X and y
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]

#knn regressor model that takes n parameters
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(X, y)

#predict value based on model
print(neigh.predict([[1.5]]))
[ 0.5]

How many nearest neighbors should we use in the linear regression? I will determine the accuracy of weekly sales predictions and choose the optimum number for the nearest neighbors. I will use K-Fold cross-validation) to determine the accuracy of my predictions.

In [160]:
# X values represent the values used to predict the weekly sales
# y is the weekly sales which is value to be predicted
X = train_df[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
y = train_df[['Weekly_Sales']]

#array keeps track of the score of each size of nearest neighbor
scores = []
for n in range(1,12):
    neigh = KNeighborsRegressor(n_neighbors=n, weights = 'distance')
    score = cross_val_score(neigh, X, y, cv = 10)
    scores.append([n, score.mean()])

#convert array to dataframe 
scores = pd.DataFrame(data = scores,columns = ['neighbors', 'cross_val_score'])
scores
Out[160]:
neighbors cross_val_score
0 1 0.469270
1 2 0.603828
2 3 0.625834
3 4 0.629009
4 5 0.627595
5 6 0.620611
6 7 0.612254
7 8 0.604367
8 9 0.597857
9 10 0.591605
10 11 0.587781

Plot the results from the table to show the optimum size to use for number of nearest neighbors

In [163]:
line = Line(scores, x='neighbors',y='cross_val_score', title="K-Nearest neighbors vs cross validation score", legend="top_left", xlabel = 'k-nearest neighbor', ylabel='cross validation score')
output_file("neighbors_score.html")
show(line)

Conclusion: From the graph above the number of nearest neighbors to use as an estimator reaches its peak when nearest neighbors at 4.

What is the optimum size to use for K-fold cross validation? First step is to determine the accuracy of your predictions using different sizes for K-fold cross validation.

In [155]:
scores_kfold = []
for kfold_size in range(3,30):
    neigh = KNeighborsRegressor(n_neighbors=4, weights = 'distance')
    score = cross_val_score(neigh, X, y, cv = kfold_size)
    
    scores_kfold.append([kfold_size, score.mean()])
    
scores_kfold_df = pd.DataFrame(data = scores_kfold,columns = ['kfold_size', 'cross_val_score'])
scores_kfold_df
Out[155]:
kfold_size cross_val_score
0 3 0.578559
1 4 0.597466
2 5 0.610447
3 6 0.621875
4 7 0.623241
5 8 0.627965
6 9 0.633157
7 10 0.629009
8 11 0.639217
9 12 0.638106
10 13 0.632692
11 14 0.643557
12 15 0.639783
13 16 0.641407
14 17 0.641564
15 18 0.641751
16 19 0.650520
17 20 0.650651
18 21 0.642341
19 22 0.648534
20 23 0.648691
21 24 0.645947
22 25 0.648927
23 26 0.643594
24 27 0.650133
25 28 0.649422
26 29 0.648745

Second step is plot the results

In [164]:
line = Line(scores_kfold_df, x='kfold_size',y='cross_val_score', title="Kfold sizes vs cross validation score", legend="top_left", xlabel = 'kfold sizes', ylabel='cross validation score')
output_file("kfold_sizes_score.html")
show(line)
INFO:bokeh.core.state:Session output file 'kfold_sizes_score.html' already exists, will be overwritten.

Observation: peak is reached when kfold-sizes is 20 and the graph levels off. Conclusion: USe 20 as the k-fold size for the cross validation score because beyond 20 as the size increases there are is no significant improvement of the cross-validation score.

5. Insight Policy and Decision Making

In [189]:
from IPython.display import HTML

We can now use our KNearest regressor model to predict future sales which will now help us learn more about future sales

In [166]:
# X values represent the values used to predict the weekly sales
# y is the weekly sales which is value to be predicted
X = train_df[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
y = train_df[['Weekly_Sales']]
neigh = KNeighborsRegressor(n_neighbors=4, weights = 'distance')
neigh.fit(X, y) 

X_test = test[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
test_predict = neigh.predict(X_test)
test_predict
Out[166]:
array([[ 30053.73380352],
       [ 30053.73380352],
       [ 30053.73380352],
       ..., 
       [  2551.75801881],
       [  2551.75801881],
       [  2551.75801881]])
In [167]:
test['Weekly_Sales_Predictions'] = test_predict
test
Out[167]:
Store Dept Date IsHoliday Year Month Day Weekly_Sales_Predictions
0 1 1 2012-11-02 False 2012 11 2 30053.733804
1 1 1 2012-11-09 False 2012 11 9 30053.733804
2 1 1 2012-11-16 False 2012 11 16 30053.733804
3 1 1 2012-11-23 True 2012 11 23 29324.176258
4 1 1 2012-11-30 False 2012 11 30 30053.733804
5 1 1 2012-12-07 False 2012 12 7 34103.658397
6 1 1 2012-12-14 False 2012 12 14 34103.658397
7 1 1 2012-12-21 False 2012 12 21 34103.658397
8 1 1 2012-12-28 True 2012 12 28 33444.921287
9 1 1 2013-01-04 False 2013 1 4 34611.266982
10 1 1 2013-01-11 False 2013 1 11 34611.266982
11 1 1 2013-01-18 False 2013 1 18 34611.266982
12 1 1 2013-01-25 False 2013 1 25 34611.266982
13 1 1 2013-02-01 False 2013 2 1 36753.528076
14 1 1 2013-02-08 True 2013 2 8 37179.804734
15 1 1 2013-02-15 False 2013 2 15 36753.528076
16 1 1 2013-02-22 False 2013 2 22 36753.528076
17 1 1 2013-03-01 False 2013 3 1 36520.935839
18 1 1 2013-03-08 False 2013 3 8 36520.935839
19 1 1 2013-03-15 False 2013 3 15 36520.935839
20 1 1 2013-03-22 False 2013 3 22 36520.935839
21 1 1 2013-03-29 False 2013 3 29 36520.935839
22 1 1 2013-04-05 False 2013 4 5 38628.377350
23 1 1 2013-04-12 False 2013 4 12 38628.377350
24 1 1 2013-04-19 False 2013 4 19 38628.377350
25 1 1 2013-04-26 False 2013 4 26 38628.377350
26 1 1 2013-05-03 False 2013 5 3 33620.723173
27 1 1 2013-05-10 False 2013 5 10 33620.723173
28 1 1 2013-05-17 False 2013 5 17 33620.723173
29 1 1 2013-05-24 False 2013 5 24 33620.723173
... ... ... ... ... ... ... ... ...
115034 45 98 2013-01-04 False 2013 1 4 2395.928406
115035 45 98 2013-01-11 False 2013 1 11 2395.928406
115036 45 98 2013-01-18 False 2013 1 18 2395.928406
115037 45 98 2013-01-25 False 2013 1 25 2395.928406
115038 45 98 2013-02-01 False 2013 2 1 1307.559346
115039 45 98 2013-02-08 True 2013 2 8 1335.415740
115040 45 98 2013-02-15 False 2013 2 15 1307.559346
115041 45 98 2013-02-22 False 2013 2 22 1307.559346
115042 45 98 2013-03-01 False 2013 3 1 1721.972220
115043 45 98 2013-03-08 False 2013 3 8 1721.972220
115044 45 98 2013-03-15 False 2013 3 15 1721.972220
115045 45 98 2013-03-22 False 2013 3 22 1721.972220
115046 45 98 2013-03-29 False 2013 3 29 1721.972220
115047 45 98 2013-04-05 False 2013 4 5 2067.940000
115048 45 98 2013-04-12 False 2013 4 12 2067.940000
115049 45 98 2013-04-19 False 2013 4 19 2067.940000
115050 45 98 2013-04-26 False 2013 4 26 2067.940000
115051 45 98 2013-05-03 False 2013 5 3 2499.883025
115052 45 98 2013-05-10 False 2013 5 10 2499.883025
115053 45 98 2013-05-17 False 2013 5 17 2499.883025
115054 45 98 2013-05-24 False 2013 5 24 2499.883025
115055 45 98 2013-05-31 False 2013 5 31 2499.883025
115056 45 98 2013-06-07 False 2013 6 7 2640.197890
115057 45 98 2013-06-14 False 2013 6 14 2640.197890
115058 45 98 2013-06-21 False 2013 6 21 2640.197890
115059 45 98 2013-06-28 False 2013 6 28 2640.197890
115060 45 98 2013-07-05 False 2013 7 5 2551.758019
115061 45 98 2013-07-12 False 2013 7 12 2551.758019
115062 45 98 2013-07-19 False 2013 7 19 2551.758019
115063 45 98 2013-07-26 False 2013 7 26 2551.758019

115064 rows × 8 columns

From the exploratory data analysis we realised that Stores of type A produce significantly higher sales than type B and type C stores. Inorder to increase sales we should consider focus our efforts on type A stores since they are the store's biggest revenue driver.

In [169]:
HTML(filename='Mean_monthly_sales_bytype.html')
Out[169]:
Bokeh Plot

Monthly sales also differ depending on store. We should expect our highest sales during the months of november and december while our lowest sales will probably be in the months of march and september.

In [170]:
HTML(filename='Mean_monthly_sales.html')
Out[170]:
Bokeh Plot

The scatter plot below shows that as the store size increases sales should increase. Management should expect higher sales from larger stores and should use the data to determine which stores efficiently use store size to increase weekly sales.

In [171]:
HTML(filename="store_size_sales_scatter.html")
Out[171]:
Bokeh Plot